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Abstract 

Detecting changes in high-dimensional time series is difficult because 
it involves the comparison of probability densities that need to be esti- 
mated from finite samples. In this paper, we present the first feature 
extraction method tailored to change point detection, which is based on 
an extended version of Stationary Subspace Analysis. We reduce the di- 
mensionality of the data to the most non-stationary directions, which are 
most informative for detecting state changes in the time series. In ex- 
tensive simulations on synthetic data we show that the accuracy of three 
change point detection algorithms is significantly increased by a prior fea- 
ture extraction step. These findings are confirmed in an application to 
industrial fault monitoring. 

1 Introduction 

Change point detection is a task that appears in a broad range of applications 
such as biomedical signal processing (9[ |25l [16] , speech recognition [H [29] , in- 
dustrial process monitoring [3] [23], fault state detection [7] and econometrics 
[5J [32] . The goal of change point detection is to find the time points at which 
a time series changes from one macroscopic state to another. As a result, the 
time series is decomposed into segments |3] of similar behavior. Change point 
detection is based on finding changes in the properties of the data, such as in the 
moments (mean, variance, kurtosis) [3], in the spectral properties [2], temporal 
structure [IB] or changes w.r.t. to certain patterns [2|. The choice of any of these 
aspects depends on the particular application domain and on the statistical type 
of the changes that one aims to detect. 

For a large family of general segmentation algorithms, state changes are 
detected based on comparing the empirical distributions between windows of 
the time series |14l IT51 [IS] . Estimating and comparing probability densities is 
a difficult statistical problem, particularly in high dimensions. However, not 
all directions in the high dimensional signal space are informative for change 
point detection: often there exists a subspace in which the distribution of the 
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Figure 1: Informative vs. uninformative directions for change point detection. 
The left panel shows the observed bivariate time series where no pronounced 
changes are visible. The middle panel shows the two underlying sources, where 
one of them exhibits clearly visible changes. In the right panel, we see that 
the stationary sources has much higher signal power than the informative non- 
stationary sources and thus masks the presence of change points in the observed 
data. 



data remains constant over time (stationary). This subspace is irrelevant for 
change point detection, but increases the overall dimensionality. Moreover, sta- 
tionary components with a high signal power can make change points invisible 
to the observer and also to detection algorithms. For example, there are no 
change points visible in the time series depicted in the left panel of Figure [l] 
even though there exists one direction in the two-dimensional signal space which 
clearly shows two change points, as it can be seen in the middle panel. However, 
the non-stationary contribution is not visible in the observed signal because of 
its relatively low power (right panel). In this example, we also observe that 
it does not suffice to select channels individually, as neither of them appears 
informative. In fact, in many application domains such as biomedical engineer- 
ing [361 I231 122"] or geophysical data analysis [2T| , it is most plausible that the 
data is generated as a mixture of underlying sources that we cannot measure 
directly. 

In this paper we show how to extract useful features for change point detec- 
tion by finding the most non-stationary directions using Stationary Subspace 
Analysis [33] . Even though there exists a wide range of feature extraction meth- 
ods for classification and regression [10 , to date no specialized procedure for 
feature extraction or for general signal processing (T2] has been proposed for 
change point detection. In controlled simulations on synthetic data, we show 
that for three representative change point detection algorithms the accuracy is 
significantly increased by a prior feature extraction step, in particular if the 
data is high dimensional. This effect is consistent over various numbers of di- 
mensions and strengths of change points. In an application to fault monitoring, 
where the ground truth is available, we show that the proposed feature extrac- 
tion improves the performance and leads to a dimensionality reduction where 
the desired state changes are clearly visible. Moreover, we also show that we 
can determine the correct dimensionality of the informative subspace. 
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The remainder of this paper is organized is follows. In the next Section [2] 
we introduce our feature extraction method that is based on an extension of 
Stationary Subspace Analysis. Section [3] contains the results of our simulations 
and in Section [4] we present the application to fault monitoring. Our conclusions 
are outlined in the last Section [5] 

2 Feature Extraction for Change-Point Detection 

Feature extraction from raw high-dimensional data has been shown to be useful 
not only for improving the performance of subsequent learning algorithms on 
the derived features [10 , but also for understanding high-dimensional complex 
physical systems where the relevant information is difficult to identify. In many 
application areas such as Computer Vision |6. , Bioinformatics [30, 23j and text 
classification [19], defining useful features is in fact the main step towards suc- 
cessful machine learning. General feature extraction methods for classification 
and regression tasks are based on maximizing the mutual information between 
features and target |34| . explaining a given percentage of the variance in the 
dataset [3T], choosing features which maximize the margin between classes [20] 
or selecting informative subsets of variables through enumerative search (wrap- 
per methods) [10 . However, for change-point detection no dedicated feature ex- 
traction has been proposed [3J. Unlike in classical supervised feature selection, 
where a target variable allows us to measure the informativeness of a feature, 
for change-point detection we cannot tell whether a feature elicits the changes 
that we aim to detect since there is usually no ground truth available. Even so, 
feature extraction is feasible following the principle that a useful feature should 
exhibit significant distributional changes over time. Reducing the dimensional- 
ity in a pre-processing step should be particularly beneficial to the change-point 
detection task: most algorithms either explicitly or implicitly make approxima- 
tions to probability densities |18| fT5] or directly compute a divergence measure 
based on summary statistics, such as the mean and covariance [3] between seg- 
ments of the time series — both are hard problems whose sample complexities 
grow exponentially with the number of dimensions. 

As we have seen in the example presented in Figure [T] selecting channels 
individually (univariate approach) is not helpful or may lead to suboptimal fea- 
tures. The overall data may be non-stationary notwithstanding the fact that 
each dimension seems stationary. Moreover, a single non-stationary source may 
be expressed across a large number of channels. It is therefore more sensible 
to estimate a linear projection of the data which contains as much information 
relating to change points as possible. In this paper, we demonstrate that finding 
the projection to the most non-stationary direction using Stationary Subspace 
Analysis significantly increases the performance of change-point detection algo- 
rithms. 

In the remainder of this section, we first review the SSA algorithm and show 
how to extend it towards finding the most non-stationary directions. Then we 
show that this approach corresponds to finding the projection that is most likely 
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to be non-stationary in terms of a statistical hypothesis test. 



2.1 Stationary Subspace Analysis 

Stationary Subspace Analysis [35j factorizes a multivariate time series x(t) £ H D 
into stationary and non-stationary sources according to the linear mixing model, 

V(t) 



s(i) = As(t) = [A 5 A n ] 



s n (t)\ ' 
s n (t) are the d n 



(1) 



where s s (t) are the d s stationary sources, s n (t) are the d n (d n + d s = D) 
non-stationary sources and A is an unknown time-constant invertible mixing 
matrix. The spaces spanned by the columns of the mixing matrix A* and A n 
are called the s- and n-spaces respectively. Note that in contrast to Independent 
Component Analysis (ICA) |13j . there is no independence assumption on the 
sources s(t). 

The aim of SSA is to invert the mixing model (Equation[TJ given only samples 
from the mixed sources x(t), i.e. we want to estimate the dcmixing matrix B 
which separates the stationary from the non-stationary sources. Applying B to 
the time series x(t) yields the estimated stationary and non-stationary sources 
s*(t) and s n (t) respectively, 



Bx(t) = 



B s 
B" 



x(t) = 



B 5 A* 
B n A s 



B 5 A n 
B n A n 



(2) 



The submatrices B* € H dsXD and B n G WL^ dn ^ xD of the estimated demixing 
matrix B project to the estimated stationary and non-stationary sources and 
are called s-projection and n-projection respectively. The estimated mixing 
matrix A is the inverse of the estimated demixing matrix, A = B^ 1 . 

The inverse of the SSA model (Equation[lJ is not unique: given one demixing 
matrix B. any linear transformation within the two groups of estimated sources 
leads to another valid separation, because it leaves the stationary resp. non- 
stationary nature of the sources unchanged. But also the separation into s- 
and n-sources itself is not unique: adding stationary components to a non- 
stationary source leaves it non-stationary, whereas the converse is not true. That 
is, the n-projection can only be identified up to arbitrary contributions from the 
stationary sources. Hence we cannot recover the true n-sources, but only the 
true s-sources (up to linear transformations). Conversely, we can identify the 
true n-space (because the s-projection is orthogonal to it) but not the true s- 
space. However, in order to extract features for change-point detection, our 
aim is not to recover the true non-stationary sources, but instead the most 
non-stationary ones. 

An SSA algorithm depends on a definition of stationarity, that the s-projection 
aims to satisfy. In the SSA algorithms |35l ITT] , a time series X t is considered 
stationary if its mean and covariance is constant over time, i.e 

u[x tl ] = nx t2 ] 

nX tl Xl] = E[X t2 Xj 2 ], 
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for all pairs of time points ii,i2 € No- This is a variant of weak stationarity [28] 
where we do not take time structure into account. Following this concept of 
stationarity, the SSA algorithm [35] finds the s-projection B B that minimizes the 
difference between the first two moments of the estimated s-sources s 5 (t) across 
epochs of the time series, since we cannot estimate the mean and covariance at a 
single time point. Thus we divide the samples from x(t) into n non-overlapping 
epochs defined by the index sets 71 , • ■ • , T n C No and estimate the epoch mean 
and covariance matrices, 



IT 



— ^2 ^c*) and 

1 teT 



Ml \T\ 

1 1 teT 



teT 



respectively for all epochs 1 < i < n. Given an s-projection, the epoch mean 
and covariance matrix of the estimated s-sources in the i-th epoch are 



B s fi t and Ef = B s ±i(B*) T . 



The difference in the mean and covariance matrix between two epochs is mea- 
sured using the Kullback-Leibler divergence between Gaussians. The objective 
function is the sum of the difference between each epoch and the average epoch. 
Since the s-sources can only be determined up to an arbitrary linear transfor- 
mation and since a global translation of the data does not change the difference 
between epoch distributions, without loss of generality we center and whiterj^] 
the data such that average epoch's mean and covariance matrix are, 



N N 
j=l i=l 

Moreover, we can restrict the search for the true s-projection to the set of 
matrices with orthonormal rows, i.e. _B s (i? s ) T = /. Thus the optimization 
problem becomes, 

N 

B s = argmin V D KL [^(/tf, E|) Af(0, 1) 
bbt=i i=1 

N 

argmin ]T (- log dct S| + (Af) T A? ) , ( i ) 

T=I i=l 



BB 

which can be solved efficiently by using multiplicative updates with orthogonal 
matrices parameterized as matrix exponentials of antisymmetric matrices |35) 

EZlQ 

A whitening transformation is a basis transformation W that sets the sample covariance 

« _ l 

matrix to the identity. It can be obtained from the sample covariance matrix S as W = E 2 . 

2 An efficient implementation of SSA may be downloaded free of charge at http://www. 
stationary- subspace- analysis . org/toolbox 
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2.2 Finding the Most Non-Stationary Sources 



In order to extract useful features for change-point detection, we would like to 
find the projection to the most non-stationary sources. However, the SSA algo- 
rithms ITT] merely estimate the projection to the most stationary sources, 
and choose the projection to the non-stationary sources to be orthogonal to the 
found s-projection, which means that all stationary contributions are projected 
out from the estimated n-sources. The justification for this choice is that it 
maximizes the non-stationarity of the n-sources in the case where the covari- 
ance between the true n- and s-sources is constant over time. This, however, 
may not always be the case: significant non-stationarity may well be contained 
in changing covariance between s- and n-sources. In fact we observe this in our 
application to fault monitoring. Thus, in order to find the most non-stationary 
sources, we also need to optimize the n-projection. Before we turn to the opti- 
mization problem, let us first of all analyze the situation more formally. 




Figure 2: The left panel shows two epoch covariance matrices Ei and £2 where 
the non-stationarity is confined to changes in the variance along one direction, 
hence the most non-stationary projection B n is orthogonal to the true stationary 
projection B 5 . This is not the case in the situation depicted in the right panel: 
here, the covariance of the two dimensions changes between Si and £2, so that 
we can find a non-stationary projection that is more non-stationary than the 
orthogonal complement of the true s-projection. 



We consider first a simple example where we have one stationary and one 
non-stationary source with corresponding normalized basis vectors ||^4 5 || = 1 
and ||^4"|| = 1 respectively, and let 4> be the angle between the two spaces, 
i.e. coscf> — A sT A a . We will consider an arbitrary pair of epochs, 71 and T2, 
and show which projection B" maximizes the difference in mean and variance 
A a between 71 and 72- 

Let X\ and X2 be bivariate random variables modeling the distribution of 
the data in the two epochs respectively. According to the linear mixing model 
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(Equation [lj) , we can write X\ and X 2 in terms of the underlying sources, 

X 1 = A S X S + A n X ni 
X 2 = A S X S + A n X, l2 

where the univariate random variable X s represents the stationary source and 
the two univariate random variables X ni and X n . 2 model the non-stationary 
sources, in the epochs 71 and T2 respectively. Without loss of generality, we 
will assume that the true s-projection B s = (A n ) is normalized, = 1. In 

order to determine the relationship between the true s-projection and the most 
non-stationary projection, we write it in terms of B s and A", 

B n = aB* + pA nT , (5) 

with coefficients a, (3 £ R such that ||-B n || = 1. In the next step, we will observe 
which n-projection maximizes the difference in mean A M and covariance A CT 
between the two epochs 7i and 7~2- Let us first consider the difference in the 
mean of the estimated n-sources, 

A M = E[B n Xx] - E[B n X 2 ] = B n A n (E[X ni ] - E[X„ 2 ]). 

This is maximal for B n A n — 1, i.e. when B n is orthogonal to B s . Thus, with 
respect to the difference in the mean, choosing the n-projection B a to be orthog- 
onal to the s-projection is always optimal, irrespective of the type of distribution 
change between epochs. 

Let us now consider the difference in variance A CT of the estimated n-sources 
between epochs. This is given by, 

A a = Var[B n X!] - V&r[B n X 2 ] = p 2 (Var[X ni ] - Var[X„ 2 ]) 

(Cov[X s ,X ni ]-Cov[X s ,X n2 }). 

Clearly, when there is no change in the covariance of the s- and the n-sources 
between the two epochs, i.e. A <Jsil = O^the difference A CT is maximized for 
B n = (B 5 )- 1 -. See the left panel of Figure^for an example. However, when the 
covariance between s- and n-sources does vary, i.e. |A - si J > 0, the projection 
(B 3 ) 1 - is no longer the most non-stationary. To see this, consider the derivative 
of An. with respect to the a at a — 0, 

dA a /da\ a=0 = 2 cos (0 + |) A CTs „ . 

Since this derivate does not vanish, a — (see Equation |5| is not an extremum 
when |A sn | > 0, which means that the most non-stationary projection is not 
orthogonal to the true s-projection. This is the case in the right panel of Fig- 
ure m 



(3 cos c/> 
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Thus, in order to find the projection to the most non-stationary sources, 
we also need to maximize the non-stationarity of the estimated n-sources. To 
that end, we simply maximize the SSA objective function (Equation [4]) for the 
n-projection, 

N 

B n = argmax V (- log det E" + (tf) T ft) , (6) 

BB^=I i=1 V ' 

where E? = B n Y, t (B n ) T and $ = B*(h for all epochs 1 < i < N. 



2.3 Relationship to Statistical Testing 

In this section we show that maximizing the SSA objective function to find the 
most non-stationary sources can be understood from a statistical testing point- 
of-view, in that it also maximizes the p-value for rejecting the null hypothesis 
that the estimated directions are stationary. 

More precisely, we maximize the p- value for a statistical hypothesis test that 
compares two models for the data: the null hypothesis H that each epoch 
follows a standard normal distribution vs. the alternative hypothesis Ha that 
each epoch is Gaussian distributed with individual mean and covariance matrix. 
Let X\, ... ,Xjsr be random variables modeling the distribution of the data in 
the N epochs. Formally, the hypothesis can be written as follows. 

H A ;X X ~ M{ji u E x ), ... t X N ~ Af(fijy, Eat) 

In other words, the statistical test tells us whether we should reject the simple 
model Hq in favor of the more complex model Ha ■ This decision is based on the 
value of the test statistics, whose distribution is known under the null hypothesis 
Hq. Since Hq is a special case of Ha and since the parameter estimates are 
obtained by Maximum Likelihood, we can use the likelihood ratio test statistic 
A [33], which is the ratio of the likelihood of the data under Hq and Ha, where 
the parameters are their maximum likelihood estimates. 

Let X C R dn be the data set which is divided into N epochs 7i , . . . , T n and 
let /t", ... , /Ujy and E", . . . , E 1 ^ be the maximum likelihood estimates of the mean 
and covariance matrices of the estimated n-sources respectively. Let pj^(x; /i, E) 
be the probability density function of the multivariate Gaussian distribution. 
The likelihood ratio test statistic is given by 

A(Af) = —2 log n^Pv(x;0,7) 

which is approximately % 2 distributed with ^Nd n (d n + 3) degrees of freedom. 
Using the facts that we have set the average epoch's mean and covariance matrix 
to zero and the identity matrix respectively, i.e. 

N N 

-5>'i=0 and ^E^" = / ' (8) 

i=l i=i 
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the test statistic simplifies to, 

, 1 N 

A(X) =--± N +-Y,Ni(- logdet + || A, n || 2 + tr(E?)) , (9) 

i=l 

where — \%\ is the number of data points in the i-th epoch. If every epoch 
contains the same number of data points (N± = • • ■ = Nn), then maximizing 
the SSA objective function (Equation |6| is equivalent to maximizing the test 
statistic (Equation [9]) and hence minimizing the p- value for rejecting the simple 
(stationary) model for the data. 

As we will see in the application to fault monitoring (Section [4]), the p- value 
of this test furnishes a useful indicator of the number of informative direc- 
tions for change point detection. More specifically, we increase the number of 
candidate stationary sources until the test returns that they are significantly 
non-stationary. As long as the p— value is large, we may safely conclude that 
these estimated stationary sources (B*s(t)) are stationary and that they may 
be removed without loss of informativeness for change point detection. In the 
specific case of change point detection, removing additional directions may lead 
to increases in performance as a result of the reduced dimensionality: this is of 
course a data dependent question. 

In Figure [3] we illustrate the results of the procedure for choosing d s , the 
number of stationary sources. We use the dataset described below in Section [3] 
The procedure for choosing d s is as follows: for each d s from 1, . . . ,D, where D 
is the dimensionality of the data set we compute the projection to the stationary 
sources. For each d s we calculate the test statistic A(A"). We then choose d s 
to be the largest such d' s such that we do not reject the null hypothesis, at the 
p = a confidence level, given in Equation [7] We display the p- values obtained 
using SSA for a fixed value for simulated data's dimensionality D — 10 and 
the number of stationary sources ranging from d\ — 1, . . . , 9 and the chosen 
parameter ranging from 1, ... ,9. The confidence level a = 0.01 for rejection of 
the null hypothesis Ho = The projected data is stationary returns the correct 
d s on average in all cases. 



3 Simulations 

In this section we demonstrate the ability of SSA to enhance the segmentation 
performance of three change-point detection algorithms on a synthetic data 
setup. The algorithms are single linkage clustering with divergence (SLCD) [8 
which uses the mean and covariance as test statistics, CUSUM [26 , which uses 
a sequence of hypothesis tests and the Kohlmorgen/Lemm [TH], which uses a 
kernel density measure and a hidden Markov model. For each segmentation 
algorithm we compare the performance of the baseline case in which the dataset 
is segmented without preprocessing, the case in which the data is preprocessed 
by projecting to a random subspace and the case in which the dataset is pre- 
processed using SSA. We compare performance with respect to the following 
schemes of parameter variation: 
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Figure 3: Average p-values obtained over 100 realizations of the dataset for 
each setting of the true d s . The number of dimensions is D = 10 and the y- 
axis displays the actual number for d s . The x— axis displays the value for the 
parameter d s used to compute the stationary projection using SSA. The red 
box shows the decision made at the p = 0.01 confidence level. The red box 
displays the choice made using this decision rule for choosing the parameter d s 
which occurs most often. The results show that the method picks the correct 
parameter on averages, over simulations. 



1. The dimensionality D of the time series is fixed and d n , the number of 
non-stationary sources is varied. 

2. The number d n of non-stationary sources is fixed and d s , the number of 
the stationary sources is varied. 

3. D. d n and d s are fixed and the power p between the changes in the non- 
stationary sources is varied. 

For two of the change point algorithms which we test, SLCD and Kohlmor- 
gen/Lemm, all three parameter variation schemes are tested. For CUSUM the 
second scheme does not apply as the method is a univariate method. 

For each setup and for each realization of the dataset we repeat segmentation 
on the raw dataset, the estimated non-stationary sources after SSA preprocess- 
ing for that dataset and on a d n dimensional random projection of the dataset. 
The random projection acts as a comparison measure for the accuracy of the 
SSA-estimated non-stationary sources for segmentation purposes. 

3.1 Synthetic Data Generation 

The synthetic data which we use to evaluate the performance of change point 
detection methods is generated as a linear mixture of stationary and non- 
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Table 1: Overview of simulations performed and corresponding figures reporting 
the results. A tick denotes that the corresponding parameter was varied in the 
experiment. A cross denotes that the corresponding parameter is kept fixed. 
("P." denotes panel within the respective figures and "Fi." denotes the figure.) 



stationary sources. The data is further generated epoch wise: each epoch has 
fixed length and each dataset consists of a concatenation of epochs. The d sta- 
tionary sources are distributed normally on each epoch according to A^(0, Id s )- 
The other d n (non-stationary) source signals s n (t) are distributed according 
to the active model k of this epoch; this active model is one of five Gaussian 
distributions Qk = A/"(0, £&): the covariance is a diagonal matrix whose 
eigenvalues are chosen at random from five log-spaced values between a\ = l/p 
and erf = p; thus five covariances, corresponding to the Gk of the Markov chain 
are then chosen in this way. The transition between models over consecutive 
epochs follows a Markov model with transition probabilities: 

I 0.025 i i= j. 
3.2 Performance Measure 

In our experiments we evaluate the algorithms based on an estimation of the 
area under the ROC curves (AUC) across realizations of the dataset. The true 
positive rate (TPR) and false positive rate (FPR) are defined with respect to 
the fixed epochs with respect to which we generate the synthetic dataset; a 
changepoint may only occur between two such epochs of fixed length. Each of 
the changepoint algorithms, which we test, reports changes with respect to the 
same division into epochs as per the synthetic dataset: thus the TPR and FPR 
are well defined. 

We use the AUC because it provides information relating to a range of TPR 
and FPR. In signal detection the tradeoff achieved between TPR and FPR de- 
pends on operational constraints: cancer diagnosis procedures must achieve a 
high TPR perhaps at the cost of a higher than desirable FPR. Network intrusion 
detection, for instance, may need to compromise the TPR given the computa- 
tional demands set by too high an FPR. In order to assess detection performance 
across all such requirements the AUC provides the most informative measure: 
all tradeoffs are integrated over. 

More specifically, each algorithm is accompanied by a parameter r which 
controls the trade off between TPR and FPR. For SLCD this is the number of 
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clusters, for CUSUM this is the threshold set on the log likelihood ratio and 
for the Kohlmorgen/Lemm this is the parameter controlling how readily a new 
state is assigned to the model. 



3.3 Single Linkage Clustering with Symmetrized Diver- 
gence Measure (SLCD) 

Single Linkage Clustering with a symmetrized distance measure is a simple algo- 
rithm for change point detection which has, however, the advantage of efficiency 
and of segmentation based on a parameter independent distance matrix (thus 
detection may be repeated for differing tradeoffs between TPR and FPR with- 
out reevaluating the distance measure). In particular, segmentation based on 
Single Linkage Clustering [8] computes a distance measure based on the covari- 
ance and mean over time windows to estimate the occurrence of changepoints: 
the algorithm consists of the following three steps. 



1. The time series is divided into 200 epochs for which we estimate the epoch- 



mean and epoch-covariance matrices {(Ai) 



The dissimilarity matrix D G ^200x200 j-) e t ween t ne epochs is computed as 
the symmetrized Kullback-Leibler divergence Z?kl between the estimated 
distributions (up to the first two moments) , 



where A/"(/x, E) is the Gaussian distribution. 

Based on the dissimilarity matrix D, Single Linkage Clustering [5j (with 
number of clusters set to k = 5) returns an assignment of epochs to clusters 
such that a changepoint occurs when two neighbouring epochs do not 
belong to the same cluster. 



3.3.1 Results 

The results of the simulations for varying numbers of non-stationary sources in 
a dataset of 30 channels are shown in Figure [5] in the first panel. When the 
degree to which the changes are visible is lower, the SSA preprocessing signifi- 
cantly outperforms the baseline method, even for a small number of irrelevant 
stationary sources. 

The results of the simulations for a varying number of stationary dimensions 
with 2 non-stationary dimensions are displayed in Figure [5] in the second panel. 
For small d the performance of the baseline and SSA preprocessing are similar: 
SSA's performance is more robust with respect to the addition of higher num- 
bers of stationary sources, i.e. noise directions. The segmentations produced 
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Stationary (1 -4) and non-stationary (5-6) sources + true changepoints (red) 
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Figure 4: An Illustration of a case in which SSA significantly improves single 
linkage clustering with divergence: d s = 4 (no. of stat. sources), d n = 2 (no. 
non-stat.), p = 2.3 (power change in non-stat. sources). The top panel displays 
the true decomposition into stationary and non-stationary sources with the true 
changepoints marked. The middle panel displays the changepoints which SLCD 
finds on the entire data set (sources mixed) : clearly some changepoints are left 
undetected. The bottom panel displays the changepoints found by SLCD on 
the estimated non-stationary sources. 



using SSA preprocessing continue to carry information relating to changepoints 
for d s = 30, whereas, for d > 12, the baseline's AUC approaches 0.5, which 
corresponds to the accuracy of randomly chosen segmentations. 

The results of the simulations for varying power p in the non-stationary 
sources with D = 20, d s = 16 (no. of stat. sources) and d n = 4 are displayed 
in Figure [5] in the third panel. Both the performance of the Baseline and of 
the SSA preprocessing improves with increasing power change p. This effect is 
evident for lower p for the SSA preprocessing than for the baseline. 

An illustration of a case in which SSA preprocessing significantly outper- 
forms the baseline is displayed in Figure|4j The estimated non-stationary sources 
exhibit a far clearer illustration of the changepoints than the full dataset: the 
corresponding segmentation performances reflect this fact. 

3.4 Weighted CUSUM for changes in variance 

In statistical quality control CUSUM (or cumulative sum control chart) is a 
sequential analysis technique developed in 1954 [26 . CUSUM is one of the most 
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No. of non-stationary sources - d n No. of stat. sources - d Power change in the non-stationary sources - p 

Figure 5: Results of the simulations for Single Linkage Clustering with Sym- 
metrized Divergence (SLCD). The left panel displays the results for a fixed 
dimensionality of the time series, D — 30 and varying d ni the number of sta- 
tionary sources. The middle panel displays the results for a fixed number of non- 
stationary sources, d n = 2 and varying d s , the number of stationary sources. 
The right panel displays the results for fixed D = 20, d a — 16 and d n = 4 and 
for varying p, the power change in the non-stationary sources. Each displays 
the results in terms of the area under the ROC curve computed as per Section 
|3.2| The error bars extend from the 25th to the 75th percentiles. 



widely used and oldest methods for change point detection; the algorithm is an 
online method for change point detection based on a series of log-likelihood 
ratio tests. Thus CUSUM algorithm detects a change in parameter 9 of a 
process p$(y) and is asymtotically optimal when the pre-change and post- 
change parameters are known [3J. For the case in which the target value of the 
changing parameter is unknown, the weighted CUSUM algorithm is a direct 
extension of CUSUM [3J, by integrating over a parameter interval, as follows. 
The following statistics constitutes likelihood ratios between the currently 
estimated parameter of the non-stationary process and differing target values 
(values to which the parameter may change), integrated over a measure F. 

Aj = ( r Phpi^yjA dm) ) (11) 

Here yj,...,yk denote the timepoints lying inside a sliding window of length 
k whereby y^ indicates the latest time point received. The stopping time is then 
given as follows: 

t a = min{k : max{j < k : La(A*) > h}} (12) 

The function F serves as a weighting function for possible target values of 
the changed parameter. In principle the algorithm can thus be applied to multi- 
dimensional data. However, as per j3J, the extension of the CUSUM algorithm 
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to higher dimensions is non-trivial, not just because integrating over possible 
values of the covariance is computationally expensive but also because various 
parameterizations can lead to the same likelihood function. Given this we test 
the effectiveness of the algorithm in computing one-dimensional segmentations. 
In particular we compare the segmentation performed on the one dimensional 
projection chosen by SSA with the best segmentation of all individual dimen- 
sions with respect to hit-rate on each trial. In accordance with [3] we choose 
F to comprise a fixed uniform interval containing all possible values of the pro- 
cess's variance. We approximate the integral above as a sum over evenly spaced 
values on that interval. We approximate the stopping time by setting: 

t a « min{fc : ln(A%_ w+1 ) > h} (13) 

The exact details of our implementation are as follows. Let T be the number of 
data points in the data set X. 

1. We set the window size W, the sensitivity constant h and the current time 
step as t c = W + 1 and 8q = var({xi, ...,xw}) and 6 = {9i, . . . ,9 r } = 
{c,c + b,c + 2b, ... ,d}. 

2 A fc = 1 V r PSj(y^—'Vk) 

■ j b l^i=l pe (V],---,Vk) 

3. If ln(A^) > h then a changepoint is reported at time t c and t c is updated 
so that t c = t c + W and 9q = var({xt c _v^+i, • • • >Zt c })- We return to step 
2. 

4. Otherwise if < h no changepoint is reported and t c = t c + l. We return 
to step 2. 

3.4.1 Results 

In Figure [6j in the left panel, the results for varying numbers of stationary 
sources are displayed. Weighted CUSUM with SSA preprocessing significantly 
outperforms the baseline for all values of D (dimensionality of the time series) . 
Here we set d n = 1, the number of non-stationary sources, for all values of d s , 
the number of stationary sources. 

In Figure [6] the right panel, the results for changes in the power change be- 
tween ergodic sections p are displayed for D — 16, d s = 15 and d n — 1. SSA out- 
performs the baseline for all except very low values of p, the power level change, 
where all detection schemes fail. The simulations show that SSA represents 
a method for choosing a one dimensional subspace to render uni-dimensional 
segmentation methods applicable to higher dimensional datasets: the resulting 
segmentation method on the one dimensional derived non-stationary source will 
be simpler to parametrize and more efficient. If the true dimensionality of the 
non-stationary part is d n then no information loss should be observed. 
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Figure 6: Results of the simulations for CUSUM. The left panel displays the 
results for a fixed number of non-stationary sources, d n = 1 and varying d s , 
the number of stationary sources. The right panel displays the results for fixed 
D = 16 and d = 15 and for varying p, the power change in the non-stationary 
sources. Each displays the results in terms of the area under the ROC curve 
computed as per Section |3.2| The error bars extend from the 25th to the 75th 
percentiles. 



3.5 Kohlmorgen/Lemm Algorithm 

The Kohlmorgen/Lemm algorithm is a flexible non-parametric and multivariate 
method which may be applied in online and offline operation modes. Distinc- 
tive about the Kohlmorgen/Lemm algorithm is that a kernel density estimator, 
rather than a simple summary statistic, is used to estimate the occurence of 
changepoints. In particular the algorithm is based on a standard Kernel Den- 
sity Estimator with Gaussian kernels and estimation of the optimal segmenta- 
tion based on a Hidden Markov Model [TB] . More specifically if we estimate the 
densities on two arbitrary epochs Ei 7 Ej of our dataset X with Gaussian ker- 
nels then we can define a distance measure d between epochs via the L2-Norm 
yielding: 



w,v— 




The final segmentation is then based on the distance matrix generated be- 
tween epochs calculated with respect to the above distance measure d. As per 
the weighted CUSUM, it is possible to define algorithms whose sensitivity to 
distributional changes in reporting changepoints is related to the value of a pa- 
rameter C: C controls the probability of transitions to new states in the fitting 
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of the hidden markov model. However, in [T7] it is shown that in the case when 
all changepoints are known then one can also derive an algorithm which returns 
exactly that number of changepoints: in simulations we evaluate the perfor- 
mance on the first variant over a full range of parameters to obtain an ROC 
curve. In addition we choose the parameter a according to the rule of thumb 
given in [TB], which sets a proportional to the mean distance of each data point 
to its D nearest neighbours, where D is the dimensionality of the data: this is 
evaluated on a sample set. The exact implementation we test is based on the 
papers [17] and [18]. The details are as follows: 

1. The time series is divided into epochs 

2. A distance matrix is computed between epochs using kernel density esti- 
mation and the L2-norm as described above. 

3. The estimated density on each epoch corresponds to a state of the Markov 
Model. So a state sequence is a sequence of estimated densities. 

4. Finally, based on the estimated states and distance matrix, a hidden 
Markov model is fitted to the data and a change point reported when- 
ever consecutive epochs have been fitted with differing states. 

3.5.1 Results 

SSA preprocessing improves the segmentation obtained using the Kohlmor- 
gen/Lemm algorithm for all 3 setups of the dataset. In particular: the area 
under the ROC (AUC) for varying d s and fixed D are displayed in Figure [7] in 
the first panel, with D — 30. The area under the ROC (AUC) for varying d s and 
fixed d n are displayed in Figure [7j in the second panel, with d n = 2. The area 
under the ROC (AUC) for varying power change in the non-stationary sources 
p and fixed D, d are displayed in Figure [7] in the third panel, with p ranging 
between 1.1 and 4.0 at increments of 0.1. Of additional interest is that for vary- 
ing d n and fixed D the performance of segmentation with SSA preprocessing is 
superior for higher values of d s : this implies that the improvement of change 
point detection of the Kohlmorgen/Lemm algorithm due to the reduction in 
dimensionality to the informative estimated n-sources outweighs the difficulty 
of the problem of estimating the n-sources in the presence of a large number of 
noise dimensions. 

4 Application to Fault Monitoring 

In this section we apply our feature extraction technique to fault monitoring. 
The dataset consists of multichannel measurements of machine vibration. The 
machine under investigation is a pump, driven by an electromotor. The incoming 
shaft is reduced in speed by two delaying gear-combinations (a gear-combination 
is a combination of driving and a driven gear) . Measurements are repeated for 
two identical machines, where the first shows a progressed pitting in both gears, 
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Power change in stat. : 



Figure 7: Results of the simulations for the Kohlmorgen/Lemm.The left panel 
displays the results for a fixed dimensionality of the time series, D = 30 and 
varying d n , the number of stationary sources. The middle panel displays the 
results for a fixed number of non-stationary sources, d n — 2 and varying d s , 
the number of stationary sources. The right panel displays the results for fixed 
D = 20, d s = 16, d n = 4 and for varying p, the power change in the non- 
stationary sources. Each displays the results in terms of the area under the 



ROC curve computed as per Section 3.2 The error bars extend from the 25th 
to the 75th percentiles. 



and the second machine is virtually fault free. The rotating speed of the driving 
shaft is measured with a tachometer. 

The pump data set is semi-synthetic insofar as we juxtapose non-temporally 
consecutive sections of data between the two pump conditions. Sections of data 
from the first and second machine are spliced randomly (with respect to the 
time axis) together to yield a dataset with 10,000 time points in seven channels. 
An illustration of the dataset is displayed in Figure [5J 

4.1 Setup 

We preprocessed with SSA using a division of the dataset into 30 equally sized 
epochs and d estimated non-stationary sources, for d s , the no. of stationary 
sources ranging between 1 and 6, where D = 7 is the dimensionality of the 
dataset: subsequently we ran the KL algorithm on both the preprocessed and 
raw data using a window size of W — 50 and a separation of 50 datapoints 
between non-overlapping epochs. 

4.2 Parameter Choice 

To select the parameter, d s (and thus d n = D — d s ), the number of stationary 
sources, we use the following scheme: the measure of stationarity over which we 

3 The dataset can be downloaded free of charge at http://www.ph.tn.tudelft.nl/~ypma/ 
mechanical . html 
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Seven Channels including changepoints 






Figure 8: Pump Dataset: the machine under investigation is a pump, driven 
by an electromotor. The measurements made are of machine vibration at seven 
sensors. The data alternates between two conditions: normal functionality and 
pitting in both gears. 



optimize for SS A and SSA is given by the loss function in equation (2) . For each 
d = dim(V s ) we compute the estimated projection to the stationary sources us- 
ing SSA on the first half of the data available and computed this loss function 
on the estimated stationary sources on the second half and compared the result 
to the values of the loss function obtained on the dataset obtained by randomly 
permuting the time axis. This random permutation should produce, on average, 
a set of approximately stationary sources regardless of non-stationarity present 
in the estimated stationary sources for that d. In addition a measure of the in- 
formation relating to non-stationarity lost in choosing the number of stationary 
sources to be d s , the Baseline-Normalized Integral Stationary Error (BNISE), 
can be defined as followed: 



BNISE(d) 



E 

d'<d 



L dl (A-\X)--E xl (L dl (A- 1 ),X') 
a x ,(L d ,(A-\X>)) 



(14) 



Where Ld'iA^ 1 , X) denotes the loss function given in equationjijon the original 
dataset with stationary parameter r and L r (A~ 1 , X') the same measure on a 
random permutation X' of the same dataset. 



4.3 Results 

The results of this scheme and the segmentation are given in Figure|9] For d s = 6 
we observe a clearly visible difference between the expected loss function value 
due to small sample sizes and the loss function value present in the estimated 
stationary sources. Similarly, looking at the p-values, we observe that for d s — 
1, 2 we do not reject the hypothesis that the estimated s— sources are stationary, 
whereas for higher values of d s we reject this hypothesis. This implies that 
d n > 5. To test the effectiveness of this scheme, segmentation is evaluated for 
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SSA preprocessing at all possible values of d s . The AUC values obtained using 
the parameter choices d s — 1, ... ,6 for SSA preprocessing as compared to the 
baseline case are displayed in Figure [9] An increase in performance with SSA 
preprocessing is robust, as measured by the AUC values, with respect to varying 
choices for the parameter d s as long as d s is not chosen < 2. Note that, although, 
for the dataset at hand, there exists information relating to changepoints in the 
frequency spectrum taken over time, this information cannot be used to bring 
the baseline method onto a par with preprocessing with SSA. We display the 
results in figure [TTJfor comparison. Here, segmentation based on a 7-dimensional 
spectrogram based on each individual channel of the dataset is computed. The 
best performance over channels, for segmentation on each of these spectrograms 
is lower than the worst performance achieved on the entire dataset without using 
spectral information, with or without SSA. 

BNISE Obj. Func Value on Estimated s-Sources 




AUC with SSA w.r.t. D-d p-values s-sources 




1234567 1 2 3 4 5 6 

d d 

n s 



Figure 9: Pump Dataset: schemes for selecting the parameter d s . Top left: 
the measure BNISE for increasing values of d n . Top right the value of the 
error function as compared to randomly generated data. Bottom left: the AUC 
performances for various values of d n . Bottom left: p-values on the estimated 
s-sources. 
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With SSA Preprocessing: D-d = 2 




Real ChangePoints 
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Figure 10: Pump Datasct: all segmentations arc computed using Kohlmorgen/ 
Lemm with the number of changepoints TV specified . The baseline corresponds 
to segmentation without SSA preprocessing. The middle panel displays segmen- 
tation with SSA preprocessing. The bottom panel displays the real changepoints 
superimposed over the raw dataset. 

Performance via spectral method 
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Figure 11: Pump Dataset: performance on spectograms computed on individual 
channels of the datatset. Each spectrogram is computed with a window length 
of 50 datapoints and overlap of 49 datapoints. 7 frequency band windows are 
used to compute a timeseries of size 7 x 10, 000. 



5 Conclusion 

Unsupervised segmentation and identification of time series is a hard problem 
even in the univariate case and has received considerable attention in science 
and industry due to its broad applicability that ranges from process control and 
finance to biomedical data analysis. 

In high dimensional segmentation problems, different subsystems of the mul- 
tivariate time series may exhibit clearer and more informative signals for seg- 
mentation than others. The present contribution has harvested this property 
by decomposing the overall system into stationary and non-stationary parts by 
means of SSA and using the non-stationary subsystem to determine the seg- 
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mentation. 

Intuitively segmentation can be understood as clustering in a function space 
(in which the estimated potentials reside) and, as shown in this paper, SSA 
contributes by choosing the most appropriate function space, which is most 
informative for the purpose of segmentation. 

This generic approach is shown to yield excellent results in simulations, il- 
lustrating the novel framework for segmentation made available by SSA. We 
expect that the proposed dimensionality reduction will be useful on a wide 
range of datasets, because the task of discarding irrelevant stationary informa- 
tion is independent of the dataset-specific distribution within the informative 
non-stationary subspace. Moreover, the SSA preprocessing is a highly versatile 
tool because it can be combined with any subsequent segmentation method. 

Applications made along the same lines as in the present contribution are 
effective only when the non-stationary part of the data is visible in the mean 
and covariances. The present method may be thus made applicable to general 
datasets whose changes consist in the spectrum or temporal domain of the data 
by computing the score function as a further preprocessing step [3]. Future 
work will also focus on computing the projection to the non-stationary sources 
directly for data whose non-stationarity is more prominent in the spectrum than 
in the mean and covariance over time. 
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